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Abstract. Advances in methods of structure determina- 
tion have led to the accumulation of large amounts of 
protein structural data. Some 500 distinct protein folds 
have now been characterized, representing one-third of 
all globular folds that exist. The range of known struc- 
tural types and the relatively large fraction of the 
protein universe that has already been sampled have 
greatly facilitated the discovery of some unifying princi- 
ples governing protein structure and evolutionary rela- 
tionships. These include a highly skewed distribution of 
topological arrangements of secondary-structure ele- 



ments that favors a few very common connectivities and 
a highly skewed distribution in the capacity of folds to 
accommodate unrelated sequences. These and other ob- 
servations suggest that the number of folds is far fewer 
than the number of genes, and that the fold universe is 
dominated by a small number of giant attractors that 
accommodate large numbers of unrelated sequences. 
Thus all basic protein folds will likely be determined in 
the near future, laying the foundation for a comprehen- 
sive understanding of the biochemical and cellular func- 
tions of whole organisms. 
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pathways. 



Introduction 

More than 10,000 protein structures are now in the 
protein data bank (PDB) [1]. These proteins cluster into 
approximately 1200 sequence families, and the families 
in turn are distributed among some 500 folds (SCOP [2] 
version 1.48). Equally striking is the rate at which 
additional structures are being determined; e.g., the past 
2 years have seen 458 new sequence families added to 
the PDB. Although the number of sequence families in 
nature is probably orders of magnitude larger than the 
number in the PDB, many of the families that are 
currently not represented will turn out to have folds 
that are already known. In fact, as we show below, the 
number of unknown folds is only twice the number of 
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folds currently known. This relatively small number of 
total folds, the increasing rate at which structures are 
being determined, and the discovery of structural prin- 
ciples governing function all point toward the expecta- 
tion of dramatic near-term advances in our 
understanding of cellular life in molecular terms. 



Structural systematics . . . 

Progress in uncovering regularities in protein structure 
has developed in part because of the abundance of 
structures, in part because of the development of effec- 
tive, phenetic classifications of structures [2-5]. These 
classifications build upon the observation that globular 
proteins are organized as a structural hierarchy [6]. At 
the base of the hierarchy are the regular secondary 
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structures, e.g., a-helices and /? -strands, where consecu- 
tive residues adopt similar backbone conformations. 
Tertiary structure is then formed by packing secondary 
structural elements into one or several compact globu- 
lar units called domains [7]. Some proteins contain 
several polypeptide chains arranged in a quaternary 
structure. 

The rigid framework formed by secondary structures is 
the best-defined part of a protein structure. The spatial 
organization of secondary structural elements, or topol- 
ogy, has been the primary means by which protein 
structures and their commonalities are characterized 
and classified [8, 9]. For example, the SCOP [2] data- 
base places protein domains in the same fold category if 
they have the same secondary structure elements in the 
same order, with the same topologies. Figure 1 shows 



the 1 5 most populated folds defined using this criterion. 
A recent comparison of SCOP with two other widely 
cited databases— FSSP [4] and CATH [5]— indicates 
that proteins assigned similar folds in one database, are 
generally assigned similar folds in the others [10]. The 
overall agreement suggests the existence of a natural 
logic in structural classification. 

In addition to introducing order to the growing volume 
of structural data, the phenetic descriptions of protein 
structure also provide powerful clues to evolutionary 
relationships [11, 12]. Empirical observations support 
the notion that structure is more robust than sequence 
[13-16]. Thus, proteins that have diverged beyond sig- 
nificant sequence similarity still retain the three-dimen- 
sional fold of their ancestors. There are many examples 
where remote homology relationships that are hidden at 




(A) 










(M) 





(O) 



Figure 1. The 15 most populated folds. They were selected on the basis of a structural annotation of proteins from completely 
sequenced genomes of 20 bacteria, five Archaea, and three eukaryotes [C. Zhang, unpublished data]. From left to right and top to 
bottom, they are: ferredoxin-like (4.45%) (A), TIM-barrel (3.94%) (B), P-loop containing nucleotide triphosphate hydrolase (3.71%) (C), 
protein kinases (PK) catalytic domain (3.14%) (/)), NAD(P)-binding Rossmann-fold domains (2.80%) (£), DNA/RNA-binding 3-helical 
bundle (2.60%) (F)» <*-a superhelix (1.95%) (G), S-adenosyl-L-methionine-dependent methyltransferase (1.92%) (//), 7-bladed beta-pro- 
peller (1.85%) (/), (^-hydrolases (1.84%) (7), PLP-dependent transferase (1.61%) (K), adenine nucleotide a-hydrolase (1.59%) (L), 
flavodoxin-like (1.49%) {M\ immunoglobulin-like £ -sandwich (1.38%) (AO, and glucocorticoid receptor-like (0.97%) (O), where the 
values in parentheses are the percentages of annotated proteins adopting the respective folds. 
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Figure 2. The nested neighborhood structure of families and folds 
in an abstract protein space. Protein families are represented by 
filled circles, folds by rectangles. Families with evidence of com- 
mon ancestry are grouped into superfamilies (enclosed by dotted 
lines). 



geometric distribution. More specifically, on a plot of 
-the number of globular folds that contain m sequence 
families against m, the distribution drops exponentially 
for m <, 6 (fig. 3). If we take the protein families whose 
structures are known as a random sample from the pool 
of .all sequence families, the observed distribution is best 
explained if the distribution of protein families among 
folds in the universe is also geometric [19].. The total 
number of folds is 



M s -(\-MJM)N s 



(1) 



where M s and N s are the observed numbers of sequence 
families and folds, respectively, and M is the total 
number of sequence families in nature. 
To apply equation 1 , it is important to first remove all 
folds with more than six sequence families because they 
belong to the non-exponential tail of the observed dis- 
tribution. This leaves us with N s = 477 folds that cover 
M 3 -ll\ families. Because M»M S , the number of 
folds can be approximated by N = ll\ x 477/(771 — 
447) = 1250. Figure 3 shows that the theoretical sam- 
pling distribution calculated using N= 1250 matches 
the observed distribution remarkably well. Adding back 
the 33 superfolds (containing 423 families) that were 
removed places the final estimate of N at about 1300. 



the sequence level have been revealed by structure com- 
parison. When proteins are mapped onto an abstract 
space such that similar proteins are neighbors, proteins 
unified by sequence similarity (protein families) and 
proteins unified by structural similarity (folds) form a 
nested neighborhood structure (fig. 2). Some protein 
families that belong to the same fold may be further 
grouped into superfamilies based on their shared ances- 
try, despite low sequence similarity. Note that proteins 
of independent origin may well have similar structures 
for purely physicochemical reasons. 
Known protein folds differ markedly in the number of 
sequence families they can accommodate [4, 17, 18]. 
Although a majority of the folds have only one or two 
representatives in the current set of structurally charac- 
terized proteins, a small number of folds are associated 
with many unrelated families of sequences. Below, we 
describe a robust estimate of the total number of folds 
[19] which agrees quantitatively with this observation. 




1 2 3 4 5 6 

The number of sequence families per fold m 



How many folds are there? 

To be specific, we use the SCOP (release 1 .48) classifica- 
tion as a standard; the use of other classifications gives 
similar results. The breakdown of folds by the number 
of families in the current structural database follows a 



Figure 3. The breakdown of protein folds by the number of 
sequence families (m). Solid triangles represent the observed dis- 
tributions based on the SCOP database (only the first six terms 
are shown). The dashed line represents the theoretical distribution 
based on the assumption that the sequence families whose struc- 
tures have been determined are a random sample from the pool of 
all existing sequence families. For details, see Zhang and DeLisi 
[19]. 
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Given the total number of sequence families M and the 
total number of folds N, the relationship between the 
protein families chosen for structure determination, M„ 
and the fraction of folds they represent, A = N s jN, is 
given by [19] 

A = Mi (7) 

M s + {\ ~MJM)N K> 

Because of the skewed distribution of protein families 
among folds, complete elucidation of all the folds by the 
default strategy (random family selection) still means 
solving the structures of virtually all non-homologous 
proteins, despite the fact that the number of folds in 
nature is limited. 

As more protein structures are solved, novel folds will 
continue to be observed. A more practical question is 
how long it will take to obtain structural sketches for a 
majority — say, 90% — of the folds. Based on equation 
2, and again assuming M»M S , to uncover 90% of the 
folds requires the structural determination of represen- 
tatives from 12,000 randomly chosen protein families. 
According to SCOP, the structures of 458 new protein 
sequence families have been reported over the past 2 
years. If this rate continues, it will take approximately 
50 years to identify 90% of the folds. To reduce this 
figure to 1 0 years, we have to increase the rate of family 
structural determination by about 20% per year. The 
alternative is to select new sequences for structural 
determination in accordance with a definite strategy, 
that maximizes the chance of uncovering a new fold. 
Such strategies can be developed [20, 21], but their 
implementation will require cooperation among the 
community of structural biologists, at a level similar to 
that developed in the genomics community during the 
past decade. The potential payoff could be the solution 
to the protein-folding problem during the next decade. 



Why do proteins prefer a small number of folds? 

Structural regularities of proteins have long been recog- 
nized to be not only present at the whole protein (or 
domain) level, but also at the substructural level [6]. 
Secondary structure elements are observed to combine 
in specific geometric arrangements. The three basic su- 
persecondary structural motifs, a -hairpin, fi -hairpin, 
and /?a/?-unit (fig. 4), are the simplest examples of such 
regularities [9, 22, 23]. These motifs are found more 
frequently in superfolds than in other folds [24], sug- 
gesting a high degree of correlation between the simplic- 
ity of secondary structure arrangement and the capacity 
of the fold. In general, protein structures have a ten- 
dency to place sequential structural elements adjacent in 
the three-dimensional space. There are indications that 
such placements support rapid and convenient folding. 




Figure 4. Supersecondary structural motifs: an a-hairpin (A), a 
^-hairpin (£), and a /fa^-unit (C). 



For example, a significant correlation has been found 
between the folding rate of small proteins and the 
average sequence separation between contacting 
residues in the native state [25]. 

Analysis of larger substructural motifs was expected to 
reveal more about the topological preference of 
proteins, but the identification of such motifs was previ- 
ously impractical because available data were limited 
[26], The exception was the Greek key motif [27] which 
was noticed shortly after the first few -/? -sandwich struc- 
tures had been solved [27, 28]. A Greek key motif 
contains four consecutive antiparallel /? -strands with 
the first strand hydrogen bonding to the last strand (fig. 
5). In particular, a composite form that consists of two 
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Figure 5. The Greek key motif. At the top are the two forms of 
the Greek key motif. A composite motif consisting of two over- 
lapping Greek keys, shown at the bottom, is the structural deter- 
minant of ^-sandwiches. For details, see Zhang and Kim [29]. 
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Figure 6A-D. The most common four-stranded /? -sheet motifs in 
open-faced /? -sheet structures. 



overlapping Greek keys (fig. 5) is present in all known 
/? -sandwiches [29]. Considering the enormous number 
of topologies that could be generated by two interwin- 
ing p -sheets, the prevalence ~of this unit is striking. It is 
tempting to speculate that the unit may play an impor- 
tant role in the folding of P -sandwich structures. 
Although the exact folding mechanisms of /? -sand- 
wiches remain unknown, recent experimental work has 
shown that several Ig-like /? -sandwich proteins do ap- 
pear to share a common folding pathway [30]. The 
folding nucleus residues of one of the Ig-like protein 
domains (FN3) were indeed mapped onto the com- 
posite Greek key motif [31]. In general, the hypothesis 
that the folding mechanism of a protein is dependent 
primarily on the topology of the native state rather than 
on specific details of the sequence has gained increasing 
support from the experimental studies of protein fold- 
ing [32, 33]. 

With the amount of structural data currently available, 
our understanding of the topological preferences of 
substructural motifs has been greatly extended; some 
general principles can now be formulated. In particular, 
more than 50% of protein domains consist of an open- 
faced P -sheet flanked with helices or loops on either or 
both sides (see fig. 1 for examples). In these structures, 
the topologies of the P -sheets often determine the to- 
pologies of the entire folds. A recent survey of these 
structures indicates that open-faced ^-sheets with more 
than four /? -strands usually contain at least one four- 



stranded /?-sheet substructure [34]. The /?-sheet sub- 
structures thus extracted were used to analyze the 
topological preferences of all 96 possible four-stranded 
P -sheet topologies. Of the 42 topologies that have been 
observed, four (fig. 6) account for 50% of the open- 
faced P -sheet structures currently known. With the ex- 
ception of the simple up-and-down meander topology 
(fig. 6D), the other three topologies all have at least one 
pair of consecutive /? -strands separated by other 
strands. In particular, the double-stranded crossover 
motif (fig. 6A) has two PaP split crossovers [35], As 
with the Greek key motif, the high frequencies of these 
motifs reflect an inherent bias in the natural usage of 
P -sheet motifs. 

Most of the unobserved four-stranded /?-sheet topolo- 
gies fall into two groups [19]. The first group contains 
topologies with alternating parallel and antiparallel /?- 
ladders (i.e., a pair of adjacent /? -strands hydrogen 
bonded to each other). Their rare occurrence reflects the 
expectation that matching different hydrogen-bonding 
patterns is energetically unfavorable. The topologies in 
the second group have complex traces and may require 
a specific sequence of steps during folding [36, 37]; this 
may result in low designability [38]. Taken together, 
these two groups of topologies may represent a section 
of topological space that is not readily accessible to 
proteins. This indicates that we have already seen a 
majority of the four-stranded P -sheet topologies that 
exist in nature, and most of the topologies that have not 
been identified may have never occurred. 
Larger protein structures currently known in general 
lack the diversity in the overall topological patterns that 
would be expected if the constituent secondary struc- 
tural elements were arranged freely. A majority of these 
structures utilize recurrent substructural units as the 
core building blocks [9, 26, 29, 34]. Therefore, the 
topological biases at the substructural level directly 
influence the diversity of protein folds. Understanding 
these biases and the underlying physics helps in under- 
standing why proteins prefer only a small fraction of 
the structural patterns. 



Folds, functions, and pathways 

The native structure is an absolute requirement for 
protein function. Although knowing the fold alone usu- 
ally does not give definite answers to all questions 
regarding function [39], the rather small number of 
basic protein folds provides a concise and powerful 
framework to organize the far larger number of biolog- 
ical functions needed by a living cell [40]. The major 
route of functional evolution is local mutation. 
Residues change as a protein evolves to satisfy modified 
functional constraints, while the basic biochemical 
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mechanism and the overall three-dimensional fold re- 
main unaltered. In most protein families, naturally oc- 
curring polymorphisms concentrate on residues that 
modulate the specificity of biological function [41]. Im- 
provements in both efficiency and specificity through 
customization of the active-site architectures is the basic 
tenet of biological evolution. Understanding how this is 
achieved and compiling a comprehensive mapping be- 
tween protein folds and their related functions will be a 
major goal of structural biologists in the next few years. 
Although exploring the evolution of proteins and their 
functions in light of structural data is only just begin- 
ning, some fundamental relationships between folds and 
functions have already been revealed [42-44]. For ex- 
ample, many analogous proteins, i.e., proteins sharing a 
common fold but not a common ancestor, are found to 
have functional sites in a common location. These loca- 
tions are called supersites by Russell et al. [42]. Proba- 
bly the most widely known supersite occurs within the 
a /fi (or TIM)-barrels (fig. 1), which have long been 
known to bind substrates at the C-terminal end of the 
/? -strands forming the barrel [45, 46]. Other supersites 
can be found in Rossmann-type doubly wound a/fi 
folds, /? -propellers, and up-and-down /? -barrels [42]. 
Ferredoxin-Iike folds (fig. 1) show a tendency to bind 
substrates on the side of the ft -sheet without a -helices 
packing against it. In many of the proteins adopting 
this fold, the /? -sheet curves on the side without packing 
a -helices to form a concave surface where substrates 
bind. A common location of binding sites within 
analogous proteins suggests a structure-function rela- 
tionship of general nature. 

Elucidation of the structure . and function of proteins 
and their interactions, and the discovery of principles 
that provide unity to the enormous diversity of struc- 
tural data, also have a deep impact on our understand- 
ing of the complex biochemical pathways. Structure 
mediates biological recognition, both within and be- 
tween cells. The signals impinging on the cell surface are 
the inputs — the boundary conditions — that modulate a 
complex network of interactions within the cell. The 
network is far more plastic than the neural net. The 
complement of genes that can be expressed by a cell 
defines the potential network, but subnets will be se- 
lected based on the specific signals on the surface as well 
as the biochemical environment inside the cell. Much of 
the network connectivity, the links, is mediated by inter- 
actions between proteins. Complex systems analysis [47] 
teaches us that, depending on its topology (connectiv- 
ity), the qualitative behavior of a network can change 
(e.g., a different group of genes could be induced, the 
cell fate could be altered). In this way, the behavior of 
the cell can be seen to be qualitatively dependent on the 
local structure of one or more proteins. Selectively 
targeting these molecules based on knowledge of their 



structures would provide a way to control cell behavior, 
an approach that will reach its full power when a 
sufficient number of structures are available. 



Concluding remarks 

The first protein fold classification, guided by the visual 
recognition of recurrent folding patterns, dates back to 
the late 1970s [8, 9], This work has been taken much 
further recently, with several systems available that 
provide comprehensive classifications of all experimen- 
tally determined structures. In parallel, many automatic 
procedures have been developed that recognize struc- 
tural similarities between proteins [3]; some of these 
procedures [48-50] are now used routinely to compare 
a newly solved protein structure with structures in the 
PDB. This bevy of structural bioinformatic tools has 
been used to infer ancient evolutionary relationships 
[1 1, 12] and to suggest functional mechanisms for hypo- 
thetical proteins [51]. 

Fold classifications are only the first step toward a 
global and comprehensive understanding of protein 
folds. To reveal the principles underlying the design of 
protein folds and find answers to many other funda- 
mental questions in structural biology requires a deeper 
understanding of the relationships among protein folds. 
The focus will be the high-order substructural motifs 
that are the common building blocks of many protein 
folds. These motifs organize the fold space into distinct 
attractors. The available data show that perhaps a 
majority of these motifs have already been observed in 
the known protein structures [26, 29, 34, 52]. More 
sensitive methods for recognizing such motifs will be of 
great value. Given the importance of topology in deter- 
mining the protein-folding mechanism, complete knowl- 
edge of the core folding units will facilitate the 
development of more effective methods for protein 
structure prediction. 

The estimates made here and elsewhere [19, 53-56] 
suggest that there may be a limited number of folds 
available to proteins. However, because of the skewed 
distribution of proteins among folds, the effort to com- 
pletely elucidate all existing folds will benefit greatly 
from a definite strategy that can maximize the informa- 
tion return from experimental structure determination. 
A joint sequence and structural classification of protein 
families offers powerful clues for judiciously choosing 
novel targets [20, 21, 57-60]. 

Structural information is- becoming an indispensable 
component of our understanding of a variety of biolog- 
ical phenomena. As more structures are determined, our 
understanding of how function is modulated by se- 
quence changes will improve. Molecular systematics 
based on protein structure provides an effective way to 
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organize the large body of functional data and to search 
for unified principles. In a foreseeable future, emergent 
cell properties and their control by human intervention 
will be traced directly to protein structure and its 
modulation. 
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